# Load the data & packages
library(tidyverse)
library(readr)
fish <- read_csv("https://Mac-STAT.github.io/data/Mercury.csv")Bootstrapping (Notes)
STAT 155
Notes
You will not be able to render this document until you’ve filled in the code!
Learning goals
Let \(\beta\) be some population parameter and \(\hat{\beta}\) be a sample estimate of \(\beta\). In order to study the potential error in \(\hat{\beta}\), you will…
explore two approaches to approximating the sampling distribution of \(\hat{\beta}\):
- Central Limit Theorem (CLT)
- bootstrapping
identify the difference between sampling and resampling
intuit how bootstrapping results can be used to make inferences about \(\beta\)
Readings and videos
Please watch the following video after class:
Exercises
Rivers contain small concentrations of mercury which can accumulate in fish. Scientists studied this phenomenon among largemouth bass in the Wacamaw and Lumber rivers of North Carolina. One goal was to explore the relationship of a fish’s mercury concentration (Concen) with its size, specifically its Length:
E(Concen | Length) = \(\beta_0\) + \(\beta_1\) Length
To this end, they caught and evaluated 171 fish, recording the following:
| variable | meaning |
|---|---|
| River | Lumber or Wacamaw |
| Station | Station number where the fish was caught (0, 1, …, 15) |
| Length | Fish’s length (in centimeters) |
| Weight | Fish’s weight (in grams) |
| Concen | Fish’s mercury concentration (in parts per million; ppm) |
Plot and model the relationship of mercury concentration with length:
fish %>%
ggplot(aes(y = Concen, x = Length)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
fish_model <- lm(Concen ~ Length, data = fish)
coef(summary(fish_model))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.13164542 0.213614796 -5.297598 3.617750e-07
## Length 0.05812749 0.005227593 11.119359 6.641225e-22Exercise 1: sample vs population
In the summary table, is the
Lengthcoefficient 0.058 the population slope \(\beta_1\) or a sample estimate \(\hat{\beta}_1\)?If it’s a sample estimate, how accurate do you think it is?
Exercise 2: The rub
Since we don’t know \(\beta_1\), we can’t know the exact error in \(\hat{\beta}_1\)! This is where sampling distributions come in. They describe how estimates \(\hat{\beta}_1\) might vary from sample to sample, thus how far these estimates might fall from \(\beta_1\):
In past activities, we used simulations to approximate the sampling distribution. For example, we took and evaluated 500 different samples of 10 counties from the population of 3142 counties. Why can’t we do that here in our fish example?
Exercise 3: CLT
In practice, we can’t observe the sampling distribution and its corresponding standard error. But we can approximate them. When our sample size n is “large enough”, we might approximate the sampling distribution using the CLT:
\[\hat{\beta}_1 \sim N(\beta_1, \text{standard error}^2)\]
The standard error in the CLT is approximated from our sample via some formula \(c / \sqrt{n}\) where “c” is complicated. Obtain and interpret this standard error from the model summary table:
coef(summary(fish_model))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.13164542 0.213614796 -5.297598 3.617750e-07
## Length 0.05812749 0.005227593 11.119359 6.641225e-22REFLECT
Great! We can approximate the sampling distribution and standard error using the CLT. BUT:
- the quality of this approximation hinges upon the validity of the Central Limit theorem which hinges upon the validity of the theoretical model assumptions
- the CLT uses complicated formulas for the standard error estimates, thus can feel a little mysterious
Let’s explore how we can use bootstrapping to complement (not entirely replace) the CLT. The saying “to pull oneself up by the bootstraps” is often attributed to Rudolf Erich Raspe’s 1781 The Surprising Adventures of Baron Munchausen in which the character pulls himself out of a swamp by his hair (not bootstraps). In short, it means to get something from nothing, through your own effort:
In this spirit, statistical bootstrapping doesn’t make any probability model assumptions. It uses only the information from our one sample to approximate standard errors.
Exercise 4: Challenge
Recall that we have a sample size of 171 fish:
nrow(fish)
## [1] 171We’ll obtain a bootstrapping distribution of \(\hat{\beta}_1\) by taking many (500) different samples of 171 fish and exploring the degree to which \(\hat{\beta}_1\) varies from sample to sample. Let’s try doing this as we did in past activities:
# Build 500 models using samples of size 171
fish_models_bad_simulation <- mosaic::do(500)*(
fish %>%
sample_n(size = 171, replace = FALSE) %>%
with(lm(Concen ~ Length))
)
head(fish_models_bad_simulation)
## Intercept Length sigma r.squared F numdf dendf .row .index
## 1 -1.131645 0.05812749 0.5805245 0.4224989 123.6401 1 169 1 1
## 2 -1.131645 0.05812749 0.5805245 0.4224989 123.6401 1 169 1 2
## 3 -1.131645 0.05812749 0.5805245 0.4224989 123.6401 1 169 1 3
## 4 -1.131645 0.05812749 0.5805245 0.4224989 123.6401 1 169 1 4
## 5 -1.131645 0.05812749 0.5805245 0.4224989 123.6401 1 169 1 5
## 6 -1.131645 0.05812749 0.5805245 0.4224989 123.6401 1 169 1 6What’s funny about the results? Why do you think this happened? How might you adjust the code to “fix” things?
Exercise 5: Resampling
In practice, we take one sample of size n from the population. To obtain a bootstrapping distribution of some sample estimate \(\hat{\beta}\), we…
take many resamples of size n with replacement from the sample
calculate \(\hat{\beta}\) using each resample
Let’s wrap our minds around the idea of resampling using a small example of 5 fish:
# Define data
small_sample <- data.frame(
id = 1:5,
Length = c(44, 43, 54, 52, 40))
small_sample
## id Length
## 1 1 44
## 2 2 43
## 3 3 54
## 4 4 52
## 5 5 40This sample has a mean Length of 46.6 cm:
small_sample %>%
summarize(mean(Length))
## mean(Length)
## 1 46.6- The chunk below samples 5 fish without replacement from our
small_sampleof 5 fish, and calculates their mean length. Run it several times. How do the sample and resulting mean change?
sample_1 <- sample_n(small_sample, size = 5, replace = FALSE)
sample_1
## id Length
## 1 2 43
## 2 3 54
## 3 1 44
## 4 4 52
## 5 5 40
sample_1 %>%
summarize(mean(Length))
## mean(Length)
## 1 46.6- Sampling our sample without replacement merely returns our original sample. Instead, resample 5 fish from our
small_samplewith replacement. Run it several times. What do you notice about the samples? About their mean lengths?
sample_2 <- sample_n(small_sample, size = 5, replace = TRUE)
sample_2
## id Length
## 1 4 52
## 2 1 44
## 3 2 43
## 4 1 44
## 5 5 40
sample_2 %>%
summarize(mean(Length))
## mean(Length)
## 1 44.6- Resampling our sample provides insight into the variability, hence potential error, in our sample estimates. (This works better when we have a sample bigger than 5!) As you observed in part b, each resample might include some fish from the original sample several times and others not at all. Why is this ok?
Exercise 6: Bootstrapping
We’re ready to bootstrap! Fix one line of the code below to obtain 500 bootstrap estimates of the model of Concen by Length:
# Set the seed so we get the same results
set.seed(155)
# Build 500 bootstrap models using REsamples of size 171
fish_models_bootstrap <- mosaic::do(500)*(
fish %>%
sample_n(size = 171, replace = FALSE) %>%
with(lm(Concen ~ Length))
)
head(fish_models_bootstrap)
## Intercept Length sigma r.squared F numdf dendf .row .index
## 1 -1.131645 0.05812749 0.5805245 0.4224989 123.6401 1 169 1 1
## 2 -1.131645 0.05812749 0.5805245 0.4224989 123.6401 1 169 1 2
## 3 -1.131645 0.05812749 0.5805245 0.4224989 123.6401 1 169 1 3
## 4 -1.131645 0.05812749 0.5805245 0.4224989 123.6401 1 169 1 4
## 5 -1.131645 0.05812749 0.5805245 0.4224989 123.6401 1 169 1 5
## 6 -1.131645 0.05812749 0.5805245 0.4224989 123.6401 1 169 1 6Exercise 7: Bootstrap results
Recall that we started with 1 sample, thus 1 estimate of the model:
coef(summary(fish_model))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.13164542 0.213614796 -5.297598 3.617750e-07
## Length 0.05812749 0.005227593 11.119359 6.641225e-22- We now have 500 (resample) bootstrap estimates of the model. These vary around the red line in the plot below. What does the red line represent: the actual population model or
fish_model(the estimated model calculated from our originalfishsample)?
fish %>%
ggplot(aes(y = Concen, x = Length)) +
geom_abline(data = fish_models_bootstrap, aes(intercept = Intercept, slope = Length), color = "gray") +
geom_smooth(method = "lm", color = "red", se = FALSE)Now focus on just the 500 (resample) bootstrap estimates of the slope
Lengthcoefficient. Before plotting the distribution of these resampled slopes, what do you anticipate? What shape do you expect the distribution will have? Around what value do you expect it to be centered?Check your intuition with the plot below. Was your intuition right?
fish_models_bootstrap %>%
ggplot(aes(x = Length)) +
geom_density()Exercise 8: Bootstrap standard errors
Since they’re calculated from resamples of our sample, not different samples from the population, the 500 bootstrap estimates of the slope are centered around our original sample estimate. Importantly:
The degree to which the bootstrap estimates vary from the original sample estimate provides insight in the degree to which our original sample estimate might vary from the actual population slope (i.e. its standard error)!
- Use the bootstrap estimates of the
Lengthslope coefficient to approximate the standard error of 0.05813, our original sample estimate. HINT: standard deviation
# fish_models_bootstrap %>%
# ___(___(Length))- How does this bootstrapped approximation of standard error compare to that calculated via (a complicated mystery) formula and reported in the model summary table?
coef(summary(fish_model))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.13164542 0.213614796 -5.297598 3.617750e-07
## Length 0.05812749 0.005227593 11.119359 6.641225e-22Pause: Powerful stuff!
Just pause here to appreciate how awesome it is that you approximated the potential error in our sample estimates using simulation and your sample data alone – no “theorems” or complicated formulas. You might say we pulled ourselves up by the bootstraps.
Exercise 9: Looking ahead at intervals
In the past few activities, we’ve been exploring sampling variability and error. These tools are critical in using our sample to make inferences about the broader population. We’ll explore inference more formally in the weeks ahead. Here, use your intuition to apply our bootstrapping results:
fish %>%
ggplot(aes(y = Concen, x = Length)) +
geom_abline(data = fish_models_bootstrap, aes(intercept = Intercept, slope = Length), color = "gray")fish_models_bootstrap %>%
ggplot(aes(x = Length)) +
geom_density()Our original sample estimate of the
Lengthcoefficient, 0.0581, was simply our best guess of the actual coefficient among all fish. But we know it’s wrong. Based on the plots above, provide a bigger range or interval of plausible values for the actual coefficient among all fish.We can do better than visual approximations. Use the
fish_models_bootstrapresults to provide a more specific interval of plausible values for the actualLengthcoefficient. THINK: Do you think we should use the full range of observed bootstrap estimates? Just a fraction?
# fish_models_bootstrap %>%
# summarize(___(Length, ___), ___(Length, ___))Exercise 10: Looking ahead at hypothesis testing
Some researchers claim that mercury content is associated with the length of a fish. Let’s use our bootstrapping results to test this hypothesis.
- Based on only the plot below of our bootstrap models, do you think our sample data supports this hypothesis?
fish %>%
ggplot(aes(y = Concen, x = Length)) +
geom_abline(data = fish_models_bootstrap, aes(intercept = Intercept, slope = Length), color = "gray")- What about numerical evidence? Based on the interval you calculated in part b of the previous exercise, do you think our sample data supports this hypothesis?
Done!
- Finalize your notes: (1) Render your notes to an HTML file; (2) Inspect this HTML in your Viewer – check that your work translated correctly; and (3) Outside RStudio, navigate to your ‘Activities’ subfolder within your ‘STAT155’ folder and locate the HTML file – you can open it again in your browser.
- Clean up your RStudio session: End the rendering process by clicking the ‘Stop’ button in the ‘Background Jobs’ pane.
- Check the solutions in the course website, at the bottom of the corresponding chapter.
- Work on homework!